SOC2069 Quantitative Methods
  • Materials
  • Data
  • Canvas
  1. Week 5
  2. [W5] Worksheet
  • Outline and materials

  • Week 1
    • Introduction
  • Week 2
    • [W2] Slides and Notes
    • [W2] Worksheet
  • Week 3
    • [W3] Slides and Notes
    • [W3] Worksheet
  • Week 4
    • [W4] Slides and Notes
    • [W4] Worksheet
  • Week 5
    • [W5] Slides and Notes
    • [W5] Worksheet
  • Week 6
    • [W6] Slides and Notes
    • [W6] Worksheet

On this page

  • Week 4 Worksheet
    • Learning outcomes
    • Intro
    • Exercise 1: From a regression line to regression coefficients
      • Task 1.1: Visualise the relationship
      • Task 1.2: Model the relationship
      • Task 1.3: Interpret the regression model output
      • Task 1.4: Find the correlation coefficient using a “correlation” test instead
    • Exercise 2: Linear regression with categorical predictors
      • Task 2.1: Describe the Region variable using a Frequency table
      • Task 2.2: Build a simple bivariate regression model
      • Task 2.3: interpret the regression model
    • Exercise 3: Build a multiple regression model
    • Exercise 4 (take-home): Which of the assignment research questions could be addressed using a linear regression model?
    • Exercise 4: Continue your analysis for Assignment 1

Week 4 Worksheet

Learning outcomes

By the end of the session, you should be familiar with:

  • running simple and multiple linear regression in JASP
  • performing a correlation analysis in JASP
  • model building in JASP
  • the interpretation of linear regression coefficients

Intro

We continue where we left off last week, taking further Week 3 Worksheet - Exercise 2 in which we made a scatter plot of inequality by social trust using the Trust & Inequality (trust_inequality.dta) dataset, which can be downloaded from https://cgmoreh.github.io/SOC2069-QUANT/Data/.

In that exercise we simplified the default output by removing the univariate distributions of the variables displayed on the margins and the regression line cutting through the plot. Now, however, we will focus on understanding what that “regression line” is actually telling us.

Exercise 1: From a regression line to regression coefficients

If you haven’t yet downloaded it last week, download the Trust & Inequality (trust_inequality.dta) dataset from https://cgmoreh.github.io/SOC2069-QUANT/Data/

Task 1.1: Visualise the relationship

As a first step, create a scatter plot visualising the “relationship” (co-variation, joint distribution, …) between social trust (trust_pct) and inequality (inequality_s80s20). This is Exercise 2 from Week 3 - if you need a reminder of how to do it, check Week 3 Worksheet - Exercise 2 or your saved .jasp file containing your workshop analysis from Week 3.

Task 1.2: Model the relationship

Now let’s dig deeper into the meaning of the regression line by building a simple bivariate linear regression model of social trust as a function of societal inequality (i.e. a model aiming to explain/predict values of social trust in various countries depending on the value of societal inequality in those countries).

To build a linear regression model in JASP, click through the Menu tabs:

\[ \text{Regression} \longrightarrow \text{[Classical] Linear regression} \] In the Linear regression panel, move the “social trust” variable to the \(\text{Dependent Variable}\) box and the “inequality” variable to the \(\text{Covariates}\) box.

The results from the linear regression model will appear in the outputs window on the right.

Task 1.3: Interpret the regression model output

In general terms, the coefficient of interest (the one associated with the independent variable) tells us: that a one-unit difference/change on the independent variable scale is associated with a difference/change in the dependent variable of the size shown by the value of the coefficient.

But what does this mean substantively in the context of our two variables?

Questions

  • Using the lecture slides and Chapter 7 (“Linear regression with a single predictor”) from the Introduction to Modern Statistics (IMS), interpret the meaning of the regression coefficient on “inequality”.
  • Add a note on the JASP output under the \(\text{Coefficients}\) output and write down your interpretation there. [Tip: You’ve already practiced adding notes to the outputs in Week 2, Exercise 3, Point 7]
  • Where can you find the coefficient of correlation (\(R\)) in the outputs? What about the coefficient of determination (\(R^2\))?

Task 1.4: Find the correlation coefficient using a “correlation” test instead

To run a simple bivariate correlation analysis in JASP, go through the Menu tabs:

\[ \text{Regression} \longrightarrow \text{[Classical] Correlation} \] Move both of the variables of interest to the \(\text{Variables}\) box.

Check if the results are the same as those obtained using linear regression

Exercise 2: Linear regression with categorical predictors

Now we will build another simple bivariate regression model, but this time we will use the variable Region to model/explain/predict levels of “social trust” in different countries. Region is the only Nominal categorical variable in this dataset, and categorical variables behave differently in regression models.

Task 2.1: Describe the Region variable using a Frequency table

Tip: You have done this a few times in previous workshops. Check back on previous exercises if you need to remind yourself of how to create a frequency table.

Task 2.2: Build a simple bivariate regression model

The steps for fitting the regression, however, are very similar to what we have done in the previous exercise:

  • Click through the Menu tabs:

\[ \text{Regression} \longrightarrow \text{[Classical] Linear regression} \]

  • In the Linear regression panel, move the “social trust” variable to the \(\text{Dependent Variable}\) box
  • BUT THIS TIME, we will move the Region variable to the \(\text{Factors}\) box instead.

This will tell JASP that the Region variable is categorical and it should model it as such, treating each of its constituent categories as an individual factor/indicator variable, automatically leaving out the first category (Task 2.1 above will tell you which one that is!) from the model so that the left out category becomes the baseline/reference to which the coefficients on all the other categories compare. What happens here is that the left-out category is absorbed into the “Intercept” (the unknown/unmeasured variation in the dependent variable).

The results from the linear regression model will appear in the outputs window on the right.

Task 2.3: interpret the regression model

In the case of numeric Scale-type predictor/independent variables the interpretation of the coefficient (“unstandardized”) was that a one-unit difference/change on the independent variable scale is associated with a difference/change in the dependent variable of the size shown by the value of the coefficient. When the predictor/independent variable is categorical, the interpretation changes somewhat. The coefficient associated with the listed category/level of the independent variable compares that category with the reference/baseline category. In other words, the unit of difference in this case is the difference between the stated and the reference category: being in the listed category as opposed to being in the reference category is associated with a difference/change in the dependent variable of the size shown by the value of the coefficient.

But what does this mean substantively in the context of our two variables?

Questions

  • Using the lecture slides and the assigned readings from Introduction to Modern Statistics (IMS), interpret the meaning of the regression coefficients on each reported level of the Region variable;
  • Which one is the “reference”/“baseline” category?
  • Add a note on the JASP output under the \(\text{Coefficients}\) output and write down your interpretation there. [Tip: You’ve already practiced adding notes to the outputs in Week 2, Exercise 3, Point 7]
  • Where can you find the coefficient of correlation (\(R\)) in the outputs? What about the coefficient of determination (\(R^2\))? Are they meaningful in this context? Why so, or why not?

Exercise 3: Build a multiple regression model

We can now combine the separate bivariate analyses in the previous two exercises into a more elaborate multiple regression model. The procedure to build a multiple regression model is the same as in the simple regression models before, but this time we add both of the independent variables into the model:

\[ \text{MENU TABS: } \text{Regression} \longrightarrow \text{[Classical] Linear regression} \]

  • In the Linear regression panel, move the “social trust” variable to the \(\text{Dependent Variable}\) box
  • Move the “inequality” variable to the \(\text{Covariates}\) box
  • Move the Region variable to the \(\text{Factors}\) box

The results will appear in the outputs window on the right. We now have a statistical model which explains variation in “social trust” not only dependent on “inequality”, but also on “Region”. Put differently - if our main aim is to estimate how “inequality” is associated with “social trust” - we have obtained a more accurate estimate of the association between “inequality” and “social trust”, while also accounting for variation due to differences in the Region to which countries belong.

Another way in which this is often expressed is that the stated coefficients are those obtained after we keep constant or eliminate the effect of the other variables in the model. This procedure is expected to give us more accurate estimates because by including further variables into the model, we have removed them from the pool of the “unknown” factors affecting/related to out outcome measurement of interest.

Questions

  • Using the lecture slides and the assigned readings from Introduction to Modern Statistics (IMS), interpret the meaning of each regression coefficient, comparing them with the ones obtained from the simpler models in the previous exercises;
  • Add a note on the JASP output under the \(\text{Coefficients}\) output and write down your interpretations there.
  • Where can you find the coefficient of correlation (\(R\)) in the outputs? What about the coefficient of determination (\(R^2\))? Are they meaningful in this context? Why so, or why not?

Exercise 4 (take-home): Which of the assignment research questions could be addressed using a linear regression model?

Exercise 4: Continue your analysis for Assignment 1

Let’s look again at the assignment research questions. Some of these questions imply a dependent variable which is measured as a numeric scale or at least a long-ish (e.g. 7-point +) ordinal scale in one of the surveys we will use for the assignment (ESS10, WVS7, EVS2017). Other questions imply dependent variables that are more strictly categorical, and as such, we cannot model them using linear regression. For those, we may be able to apply another model type that better fits that kind of outcome variable (e.g. logistic regression), one of which we will be covering in Week 5.

In this exercise, explore the survey questionnaires (like we did in Week 2 and 3) to identify any available variables for answering one/some of the questions below, and check how the implied dependent variable was measured:

  1. Are religious people more satisfied with life?
  2. Are older people more likely to see the death penalty as justifiable?
  3. What factors are associated with opinions about future European Union enlargement among Europeans?
  4. Is higher internet use associated with stronger anti-immigrant sentiments?
  5. How does victimisation relate to trust in the police?
  6. What factors are associated with belief in life after death?
  7. Are government/public sector employees more inclined to perceive higher levels of corruption than those working in the private sector?

For now, you can explore the survey questionnaires available from the original survey websites, but you will shortly have them available more conveniently from the module’s website!

[W5] Slides and Notes
[W6] Slides and Notes